Home
AdvExplore
AdvGLUE
AdvGLUE
The Adversarial GLUE Benchmark
Performance of RoBERTa (single model) on AdvGLUE
Overall Statistics
96.0
Accuracy
61.1
51.4
70.3
58.5
92.0
89.4
F1
Accuracy
64.7
51.5
48.6
9.5
24.6
14.0
57.1
41.8
94.1
Accuracy
62.5
49.2
28.5
52.5
86.6
Accuracy
42.3
49.3
45.4
89.7
Accuracy
55.8
43.1
50.8
89.9
0
100
Accuracy
58.0
0
100
36.5
0
100
22.5
0
100
39.6
0
100
GLUE Dev
AdvGLUE Word
AdvGLUE Sentence
AdvGLUE Human
AdvGLUE Overall
SST-2
QQP
QNLI
RTE
MNLI-m
MNLI-mm
plotly-logomark
Performance of RoBERTa (single model) on each task
The Stanford Sentiment Treebank (SST-2)
56.2
69.2
66.7
57.7
56.3
Typo
Knowledge
Embedding
Context
Composition
45.0
61.1
Syntactic
Distraction
70.3
0
100
CheckList
Adversarial Acc
Word
Sentence
Human
plotly-logomark
Quora Question Pairs (QQP)
64.0
82.4
69.0
60.0
62.9
Typo
Knowledge
Embedding
Context
Composition
47.1
66.7
38.1
22.2
58.2
48.6
Syntactic
9.5
24.6
0
100
CheckList
14.0
0
100
Adversarial Acc
Adversarial F1
Word
Sentence
Human
plotly-logomark
MultiNLI (MNLI) matched
59.3
60.7
50.0
57.3
54.1
Typo
Knowledge
Embedding
Context
Composition
42.3
44.4
0
100
Syntactic
Distraction
Adversarial Acc
Word
Sentence
plotly-logomark
MultiNLI (MNLI) mismatched
44.6
67.2
70.4
65.5
56.4
Typo
Knowledge
Embedding
Context
Composition
31.1
45.9
Syntactic
Distraction
16.6
28.4
0
100
StressTest
ANLI
Adversarial Acc
Word
Sentence
Human
plotly-logomark
Question NLI (QNLI)
68.1
59.2
53.4
59.2
68.3
Typo
Knowledge
Embedding
Context
Composition
39.3
62.9
Syntactic
Distraction
35.1
23.8
0
100
CheckList
AdvSQuAD
Adversarial Acc
Word
Sentence
Human
plotly-logomark
Recognizing Textual Entailment (RTE)
47.8
35.5
48.8
29.6
45.5
Typo
Knowledge
Embedding
Context
Composition
46.6
54.2
0
100
Syntactic
Distraction
Adversarial Acc
Word
Sentence
plotly-logomark
AdvGLUE
UIUC Secure Learning Lab
Microsoft Research